We want to analyse the climate change in the period between 1951-1980
First, we load the data from Combined Land-Surface Air and Sea-Surface Water Temperature Anomalies in the Northern Hemisphere at NASA’s Goddard Institute for Space Studies. The tabular data of temperature anomalies can be found here
weather <-
read_csv("https://data.giss.nasa.gov/gistemp/tabledata_v4/NH.Ts+dSST.csv",
skip = 1,
na = "***")We modified the data by adding skip and na, and we select the year and the twelve month variables from the dataset while deleting the rest as they are non-relevant.
tidyweather <- weather %>% select(1:13) %>% pivot_longer(cols=2:13, names_to = "Month", values_to="delta")
tidyweather## # A tibble: 1,704 x 3
## Year Month delta
## <dbl> <chr> <dbl>
## 1 1880 Jan -0.35
## 2 1880 Feb -0.5
## 3 1880 Mar -0.23
## 4 1880 Apr -0.29
## 5 1880 May -0.05
## 6 1880 Jun -0.15
## 7 1880 Jul -0.18
## 8 1880 Aug -0.25
## 9 1880 Sep -0.23
## 10 1880 Oct -0.32
## # … with 1,694 more rows
Now we inspect the dataframe, it has three columns indicating the Year, Month and the delta value of climate change which shows how much the temperature varys from the base year.
We plot the data using a time-series scatter plot, and have added a trendline. We first created a new variable called date in order to ensure that the delta values are plot chronologically.
# convert the date time datatype
tidyweather <- tidyweather %>%
mutate(date = ymd(paste(as.character(Year), Month, "1")),
month = month(date, label=TRUE),
year = year(date))
# plotting the scatter plot of the data
ggplot(tidyweather, aes(x=date, y = delta))+
geom_point()+
geom_smooth(color="red") +
theme_bw() +
labs (
title = "Weather Anomalies",
x = "Date",
y = "Temperature deviation"
)+
NULLNow we will visualize the data in a per-month basis, to see if the effect of increasing temperature is more pronounced in some months than others.
#plotting the by-month scatter plot of data
ggplot(tidyweather, aes(x=date, y = delta))+
geom_point()+
geom_smooth(color="red") +
geom_hline(yintercept = 0, color="orange")+
theme_bw() +
labs (
title = "Weather Anomalies",
x = "Date",
y = "Temperature deviation"
)+
facet_wrap(~month)+
NULL From the above chart we can see the temperature deviation is smaller from May to August as the data points are less spread out on the y axis.
To study the historical data, we find it useful to group data into different time periods. Therefore, we created a new data frame called comparison that groups data in five time periods: 1881-1920, 1921-1950, 1951-1980, 1981-2010 and 2011-present.
We removed data before 1800 and before using filter. Then, we use the mutate function to create a new variable interval which contains information on which period each observation belongs to. We can assign the different periods using case_when().
comparison <- tidyweather %>%
filter(Year>= 1881) %>% #remove years prior to 1881
#create new variable 'interval', and assign values based on criteria below:
mutate(interval = case_when(
Year %in% c(1881:1920) ~ "1881-1920",
Year %in% c(1921:1950) ~ "1921-1950",
Year %in% c(1951:1980) ~ "1951-1980",
Year %in% c(1981:2010) ~ "1981-2010",
TRUE ~ "2011-present"
))By clicking on it in the Environment pane, we inpected the dataframe and it has 7 columns: Year, Month, delta, date, month, year, interval which we just created.
Now that we have the interval variable, we can create a density plot to study the distribution of monthly deviations (delta), grouped by the different time periods we are interested in.
# Set `fill` to `interval` to group and colour the data by different time periods.
ggplot(comparison, aes(x=delta, fill=interval))+
geom_density(alpha=0.2) + #density plot with tranparency set to 20%
theme_bw() + #theme
labs (
title = "Density Plot for Monthly Temperature Anomalies",
y = "Density" #changing y-axis label to sentence case
) From this we can see that as time goes by, the average delta of climate change increases from negative to positive, indicating the temperature is increasing. We can also see the temperature is increasing at a larger rate since 1951.
We are also interested in average annual anomalies, therefore we further modified the data to produce a scatter plot as below:
#creating yearly averages
average_annual_anomaly <- tidyweather %>%
group_by(Year) %>% #grouping data by Year
# creating summaries for mean delta
# use `na.rm=TRUE` to eliminate NA (not available) values
summarise(annual_average_delta = mean(delta, na.rm=TRUE))
#plotting the data:
ggplot(average_annual_anomaly, aes(x=Year, y= annual_average_delta))+
geom_point()+
#Fit the best fit line, using LOESS method
geom_smooth() +
#change to theme_bw() to have white background + black frame around plot
theme_bw() +
labs (
title = "Average Yearly Anomaly",
y = "Average Annual Delta"
) As we can see from the plot, it corresponds to our earlier conclusion that the temperature is increasing at a larger rate starting significantly since 1960s.
deltaNASA points out on their website that
A one-degree global change is significant because it takes a vast amount of heat to warm all the oceans, atmosphere, and land by that much. In the past, a one- to two-degree drop was all it took to plunge the Earth into the Little Ice Age.
We want to construct a confidence interval for the average annual delta since 2011, both using a formula and using a bootstrap simulation with the infer package. Recall that the dataframe comparison has already grouped temperature anomalies according to time intervals; we are only interested in what is happening between 2011-present.
formula_ci <- comparison %>% filter(interval =="2011-present") %>%
summarise(annual_average_delta = mean(delta, na.rm=TRUE),
sd_delta = sd(delta, na.rm=TRUE),
count = n(),
se_delta = sd_delta/sqrt(count),
t_critical = qt(0.975, count-1),
margin_of_error = t_critical * se_delta,
delta_low = annual_average_delta - margin_of_error,
delta_high = annual_average_delta + margin_of_error)
# choose the interval 2011-present
# what dplyr verb will you use?
# calculate summary statistics for temperature deviation (delta)
# calculate mean, SD, count, SE, lower/upper 95% CI
# what dplyr verb will you use?
# summarise(ann_aver_delta = mean(delta,na.rm=T),
# sd_delta = sd(delta,na.rm=T),
# count=n(),
# t_critical = qt(0.975,count-1),
# margin_of_error = t_critical*sd_delta/sqrt(count),
# delta_low = ann_aver_delta-margin_of_error,
# delta_high = ann_aver_delta+margin_of_error)
#print out formula_CI
formula_ci## # A tibble: 1 x 8
## annual_average_d… sd_delta count se_delta t_critical margin_of_error delta_low
## <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
## 1 1.06 0.276 132 0.0240 1.98 0.0475 1.01
## # … with 1 more variable: delta_high <dbl>
# use the infer package to construct a 95% CI for delta
set.seed(1234)
boot_delta<- comparison %>%
filter(interval =="2011-present") %>%
specify(response = delta) %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "mean") %>%
get_confidence_interval(level = 0.95, type = "percentile")
boot_delta## # A tibble: 1 x 2
## lower_ci upper_ci
## <dbl> <dbl>
## 1 1.01 1.11
First we filtered out the years to include only 2011-present. Then we calculated summary statistics including the mean,sd,se and from this we calculated the confidence interval.
Using both the bootstrap method and the formula method we got a 95% confidence interval for delta as (1.01,1.11). This means that we are 95% confident that the true mean for 2011-present for delta lies within the range of (1.01,1.11). We can confirm that there is a net increase in temperature since the base year.
We will now start our analysis of Biden’s approval ratings. Fivethirtyeight.com has detailed data on all polls that track the president’s approval
# Import approval polls data directly off fivethirtyeight website
approval_polllist <- read_csv('https://projects.fivethirtyeight.com/biden-approval-data/approval_polllist.csv')
glimpse(approval_polllist)## Rows: 1,597
## Columns: 22
## $ president <chr> "Joseph R. Biden Jr.", "Joseph R. Biden Jr.", "Jos…
## $ subgroup <chr> "All polls", "All polls", "All polls", "All polls"…
## $ modeldate <chr> "9/10/2021", "9/10/2021", "9/10/2021", "9/10/2021"…
## $ startdate <chr> "1/24/2021", "1/24/2021", "1/25/2021", "1/25/2021"…
## $ enddate <chr> "1/26/2021", "1/27/2021", "1/27/2021", "1/26/2021"…
## $ pollster <chr> "Rasmussen Reports/Pulse Opinion Research", "Maris…
## $ grade <chr> "B", "A", "B", "B", "B", "B", "B", "B+", "B", "B",…
## $ samplesize <dbl> 1500, 1313, 1500, 2200, 15000, 9212, 1500, 906, 15…
## $ population <chr> "lv", "a", "lv", "a", "a", "a", "lv", "a", "a", "a…
## $ weight <dbl> 0.3225, 2.1893, 0.3025, 0.1205, 0.2739, 0.1520, 0.…
## $ influence <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ approve <dbl> 48.0, 49.0, 49.0, 58.0, 54.0, 55.0, 50.0, 57.0, 54…
## $ disapprove <dbl> 48.0, 35.0, 48.0, 32.0, 31.0, 31.5, 45.0, 37.0, 32…
## $ adjusted_approve <dbl> 50.4, 50.0, 51.4, 56.5, 52.5, 53.5, 52.4, 56.3, 52…
## $ adjusted_disapprove <dbl> 41.9, 34.6, 41.9, 35.4, 34.4, 34.9, 38.9, 36.4, 35…
## $ multiversions <chr> NA, NA, NA, NA, NA, "*", NA, NA, NA, NA, NA, NA, N…
## $ tracking <lgl> TRUE, NA, TRUE, NA, TRUE, TRUE, TRUE, NA, TRUE, NA…
## $ url <chr> "https://www.rasmussenreports.com/public_content/p…
## $ poll_id <dbl> 74261, 74320, 74268, 74346, 74277, 74292, 74290, 7…
## $ question_id <dbl> 139433, 139558, 139483, 139653, 139497, 139518, 13…
## $ createddate <chr> "1/27/2021", "2/1/2021", "1/28/2021", "2/5/2021", …
## $ timestamp <chr> "18:35:08 10 Sep 2021", "18:35:08 10 Sep 2021", "1…
# Use `lubridate` to fix dates, as they are given as characters.
approval_polllist <- approval_polllist %>%
mutate(modeldate = mdy(modeldate),
startdate = mdy(startdate),
enddate = mdy(enddate),
createddate = mdy(createddate))Now we will calculate the average net approval rate (approve- disapprove) for each week since Biden got into office. We will plot Biden’s net approval, along with its 95% confidence interval. For the date, we will use enddate, i.e., the date the poll ended.
In the bottom graph, the confidence intervals for week 3 and week 25 are very different. This could be due to a difference in sample size, or variation in responses. We believe it is more likely that the difference is due to week 3 having a much smaller sample size.
In the top graph we compared week 4 and week 25. There is not much of a difference in confidence intervals, suggesting the sample size is probably similar for both week 4 and week 25.
Now we look at multiple data frames listed below-
# load gapminder HIV data
hiv <- read_csv(here::here("data","adults_with_hiv_percent_age_15_49.csv"))
life_expectancy <- read_csv(here::here("data","life_expectancy_years.csv"))
# get World bank data using wbstats
indicators <- c("SP.DYN.TFRT.IN","SE.PRM.NENR", "SH.DYN.MORT", "NY.GDP.PCAP.KD")
library(wbstats)
worldbank_data <- wb_data(country="countries_only", #countries only- no aggregates like Latin America, Europe, etc.
indicator = indicators,
start_date = 1960,
end_date = 2016)
# get a dataframe of information regarding countries, indicators, sources, regions, indicator topics, lending types, income levels, from the World Bank API
countries <- wbstats::wb_cachelist$countriesYou have to join the 3 dataframes (life_expectancy, worldbank_data, and HIV) into one. You may need to tidy your data first and then perform join operations. Think about what type makes the most sense and explain why you chose it.
hiv_long <- hiv %>%
pivot_longer(cols = 2:34, #columns 2 to 34
names_to = "Year",
values_to = "HIV_Value")
skim(hiv_long)| Name | hiv_long |
| Number of rows | 5082 |
| Number of columns | 3 |
| _______________________ | |
| Column type frequency: | |
| character | 2 |
| numeric | 1 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| country | 0 | 1 | 3 | 24 | 0 | 154 | 0 |
| Year | 0 | 1 | 4 | 4 | 0 | 33 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| HIV_Value | 1781 | 0.65 | 1.74 | 4.09 | 0.01 | 0.1 | 0.3 | 1.2 | 26.5 | ▇▁▁▁▁ |
hiv_long## # A tibble: 5,082 x 3
## country Year HIV_Value
## <chr> <chr> <dbl>
## 1 Afghanistan 1979 NA
## 2 Afghanistan 1980 NA
## 3 Afghanistan 1981 NA
## 4 Afghanistan 1982 NA
## 5 Afghanistan 1983 NA
## 6 Afghanistan 1984 NA
## 7 Afghanistan 1985 NA
## 8 Afghanistan 1986 NA
## 9 Afghanistan 1987 NA
## 10 Afghanistan 1988 NA
## # … with 5,072 more rows
life_expectancy_long <- life_expectancy %>%
pivot_longer(cols = 2:302, #columns 2 to 302
names_to = "Year",
values_to = "Life_Expectancy_Value")
skim(life_expectancy_long)| Name | life_expectancy_long |
| Number of rows | 56287 |
| Number of columns | 3 |
| _______________________ | |
| Column type frequency: | |
| character | 2 |
| numeric | 1 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| country | 0 | 1 | 3 | 30 | 0 | 187 | 0 |
| Year | 0 | 1 | 4 | 4 | 0 | 301 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Life_Expectancy_Value | 759 | 0.99 | 53 | 21.7 | 1.01 | 32.3 | 48.7 | 74.2 | 94.8 | ▁▇▂▅▅ |
life_expectancy_long## # A tibble: 56,287 x 3
## country Year Life_Expectancy_Value
## <chr> <chr> <dbl>
## 1 Afghanistan 1800 28.2
## 2 Afghanistan 1801 28.2
## 3 Afghanistan 1802 28.2
## 4 Afghanistan 1803 28.2
## 5 Afghanistan 1804 28.2
## 6 Afghanistan 1805 28.2
## 7 Afghanistan 1806 28.1
## 8 Afghanistan 1807 28.1
## 9 Afghanistan 1808 28.1
## 10 Afghanistan 1809 28.1
## # … with 56,277 more rows
#Merged HIV & Life_Expectancy data by matching Year and Country
hiv_life_expectancy <- hiv_long %>% inner_join( life_expectancy_long , by = c ( "country","Year" ))
#Renaming date column to Year on World Bank Data
worldbank_data <- rename(worldbank_data ,Year=date)
worldbank_data <- worldbank_data %>%
mutate(Year = as.character(Year))
#skim(hiv_life_expectancy)
#skim(worldbank_data)
#Merging Life Exp & HIV data with World bank data by Country & Year
hiv_life_expectancy_worldbank_data <- hiv_life_expectancy %>% inner_join( worldbank_data , by = c ( "country","Year" ) )hiv_life_expectancy_worldbank_data <- hiv_life_expectancy_worldbank_data %>%
#create new variable 'interval', and assign values based on criteria below:
mutate(interval = case_when(
Year %in% c(1980:1989) ~ "1979-1989",
Year %in% c(1990:2000) ~ "1990-2000",
Year %in% c(2001:2010) ~ "2001-2010",
#Year %in% c(1999:2009) ~ "1999-2009",
TRUE ~ "2011-present"
))
ggplot(hiv_life_expectancy_worldbank_data, aes(x = HIV_Value , y = Life_Expectancy_Value)) +
geom_point() +
geom_smooth(method = "lm") +
labs(
title = "Relationship of HIV vs Life Expectancy",
x = "HIV Value",
y = "Life Expectancy"
)+
facet_wrap(~interval)+
NULLWe have faceted the graphs by intervals. We see that as the HIV Value increases the life expectancy decreases.
library(countrycode)
#Creating a dataset hiv_life_expectancy_worldbank_data_continent which has a column for continents corresponding to each country
hiv_life_expectancy_worldbank_data_continent <- hiv_life_expectancy_worldbank_data
hiv_life_expectancy_worldbank_data_continent <- cbind(hiv_life_expectancy_worldbank_data, new_col = "continent")
hiv_life_expectancy_worldbank_data_continent$continent <- countrycode(sourcevar = hiv_life_expectancy_worldbank_data_continent$country,
origin = "country.name",
destination = "continent")
ggplot(hiv_life_expectancy_worldbank_data_continent, aes(x = NY.GDP.PCAP.KD , y = SP.DYN.TFRT.IN)) +
geom_point() +
geom_smooth(method = "lm") +
labs(
title = "Relationship",
x = "GDP",
y = "Fertility"
)+
facet_wrap(~continent)+
NULL - As we can see from plotting the gdppercapita against the fertility rate, when GDP per capita increases the fertility rate decreases substantially and reaches an upper bound at a fertility rate of 3
In Asia, Africa, Americas and Ocenia we see the trend that the higher the GDP the lower is the Fertility rate
However, in Europe we see that the higher GDP countries also have a higher fertility rate
#aggregate(x = HIV_Value, data=hiv_life_expectancy_worldbank_data, count(is.na(x)))
hiv_missing <- hiv_life_expectancy_worldbank_data %>%
mutate(na_yes_no = ifelse( is.na(HIV_Value) , "Yes" , "No" ) ) %>%
group_by(country,na_yes_no) %>%
summarise(count_missing=n()) %>%
filter( na_yes_no == "Yes" ) %>%
arrange(-count_missing)
hiv_missing[1:15,] %>%
# slice_max ( order_by = count_missing, n=5 ) %>%
ggplot(aes(x = count_missing, y = fct_reorder(country, count_missing))) +
geom_col(fill="orange")+
labs(x="No. of missing values" , y="Country" , title= "Missing values of HIV data per country")+
NULLIn each region, find the top 5 countries that have seen the greatest improvement, as well as those 5 countries where mortality rates have had the least improvement or even deterioration.
#now we filter the data from year 2011 and 1979
mortality_1979_2011 <- hiv_life_expectancy_worldbank_data_continent %>%
filter ( as.numeric(Year) %in% c( 1979,2011 ) )
#now we select the columns to consider
mortality_1979_2011 <- select( mortality_1979_2011 , c( "Year", "continent" , "country" , "SH.DYN.MORT" ))
#we rename the columns to add a character
mortality_1979_2011$Year = paste("y",mortality_1979_2011$Year,sep="")
#now we make the table wider to allow us to select the difference in values later
mortality_1979_2011_wide <- pivot_wider( data=mortality_1979_2011 , names_from = Year, values_from = SH.DYN.MORT )
#now we add an additional column to calcualte the difference in mortality values
mortality_1979_2011_wide <- mortality_1979_2011_wide %>% mutate(mortality_diff = y2011 - y1979 )
#we group by the region and rank the mortality differences
mortality_1979_2011_ranked <- mortality_1979_2011_wide %>%
group_by(continent) %>%
summarise( country=country, asc_ranking = rank(mortality_diff) , dsc_ranking = rank(-mortality_diff) )
#as per our analysis the lesser mortality range difference means the worst improvement and the largest mortality rate differnce in 1979 and 2011 shows the best improved country
mortality_best5 <- mortality_1979_2011_ranked %>% filter( asc_ranking <=5 )
mortality_worst5 <- mortality_1979_2011_ranked %>% filter( dsc_ranking <=5 )
mortality_worst5 <-select( mortality_worst5 , c ( "continent" , "country"))
mortality_worst5## # A tibble: 24 x 2
## # Groups: continent [5]
## continent country
## <chr> <chr>
## 1 Africa Botswana
## 2 Africa Central African Republic
## 3 Africa Lesotho
## 4 Africa Mauritius
## 5 Africa Zimbabwe
## 6 Americas Barbados
## 7 Americas Canada
## 8 Americas Costa Rica
## 9 Americas Cuba
## 10 Americas United States
## # … with 14 more rows
mortality_best5 <-select( mortality_worst5 , c ( "continent" , "country"))
mortality_best5## # A tibble: 24 x 2
## # Groups: continent [5]
## continent country
## <chr> <chr>
## 1 Africa Botswana
## 2 Africa Central African Republic
## 3 Africa Lesotho
## 4 Africa Mauritius
## 5 Africa Zimbabwe
## 6 Americas Barbados
## 7 Americas Canada
## 8 Americas Costa Rica
## 9 Americas Cuba
## 10 Americas United States
## # … with 14 more rows
ggplot(hiv_life_expectancy_worldbank_data_continent, aes(x = SP.DYN.TFRT.IN , y = SE.PRM.NENR)) +
geom_point() +
geom_smooth(method = "lm") +
labs(
title = "Relationship",
x = "Fertility",
y = "School Enrollment"
)+
facet_wrap(~continent)+
NULLOnce again we see that Europe and the rest of the continents have different trends. In Asia, Afirca, Americas and Oceania - the higher the fertility, the lesser is the school enrollment.
In Europe, the higher the fertility rate the more is the school enrollment.
Recall the TfL data on how many bikes were hired every single day. We can get the latest data by running the following
url <- "https://data.london.gov.uk/download/number-bicycle-hires/ac29363e-e0cb-47cc-a97a-e216d900a6b0/tfl-daily-cycle-hires.xlsx"
# Download TFL data to temporary file
httr::GET(url, write_disk(bike.temp <- tempfile(fileext = ".xlsx")))## Response [https://airdrive-secure.s3-eu-west-1.amazonaws.com/london/dataset/number-bicycle-hires/2021-08-23T14%3A32%3A29/tfl-daily-cycle-hires.xlsx?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAJJDIMAIVZJDICKHA%2F20210911%2Feu-west-1%2Fs3%2Faws4_request&X-Amz-Date=20210911T191327Z&X-Amz-Expires=300&X-Amz-Signature=fd99a32f8180134009d0aaf423a2c8bc4c5b513f66633ec7c330ddfe89f8775c&X-Amz-SignedHeaders=host]
## Date: 2021-09-11 19:14
## Status: 200
## Content-Type: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
## Size: 173 kB
## <ON DISK> /var/folders/3z/z0hfc5056ns4j_nqcbwy64hw0000gn/T//RtmpD2zVC7/file76006f441d0a.xlsx
# Use read_excel to read it as dataframe
bike0 <- read_excel(bike.temp,
sheet = "Data",
range = cell_cols("A:B"))
# change dates to get year, month, and week
bike <- bike0 %>%
clean_names() %>%
rename (bikes_hired = number_of_bicycle_hires) %>%
mutate (year = year(day),
month = lubridate::month(day, label = F),
week = isoweek(day))We can easily create a facet grid that plots bikes hired by month and year.
In May and June of 2020 there is a huge decline in bike rentals due to the pandemic.
We will know reproduce the following graph. The graph looks at the monthly change in TfL from the monthly averages calculated in 2016-2019. The blue line is the mean bike rentals of the months over 2016-2019. The red shaded region shows the months where the monthly rentals fell below the average and the green shows the months where it was above the average.
# Calculates the average monthly bikes rented using data from 2016 to 2021.
expected_hires <- bike %>%
filter(day>="2016-01-01")%>%
group_by(year, month) %>%
summarize(bikes_hired = mean(bikes_hired)) %>% #takes the daily data and creates a monthly mean for each year/month combo
group_by(month) %>%
summarise(expected_hired = mean(bikes_hired)) #outputs mean bike rentals by month (Jan-Dec) with only 12 rows
#modifying the dataset and adding the averages previously calculated in expected_hires
plot_bike <- bike %>%
filter(day>="2016-01-01")%>%
group_by(year, month) %>%
summarize(bikes_hired = mean(bikes_hired)) %>% #gets the average bikes for each year/month combo 1/2016, 2/2016 ....
inner_join(expected_hires, by = "month") %>% #adds column with the average bike rentals to each year/month combo
mutate(fill = bikes_hired>expected_hired, #creates a True/Flase column if bikes rentals are above or below the average
up = ifelse(bikes_hired>expected_hired, bikes_hired-expected_hired, 0), #calculates if above the average and the number of rentals above, if it is not 0 is given.
down = ifelse(bikes_hired>expected_hired, 0,bikes_hired-expected_hired), #calculates if below the average and the number of rentals below, if it is not 0 is given.
Month = month(month, label=T)) #gets the month value in chr format
plot_bike$lower = apply(plot_bike[,3:4],1,min) # creates a column taking the smallest value from actual vs average bikes hired
plot_bike$higher = apply(plot_bike[,3:4],1,max) # creates a column taking the largest value from actual vs average bikes hired
plot_bike$date = ym(paste(plot_bike$year,plot_bike$month)) #creates column with date in ym format
#Recreating the plot
plot_bike %>%
ggplot(aes(x=Month)) +
geom_line(aes( y=expected_hired, group=year),colour="blue",size=2)+ #draws the average
geom_line(aes(y=bikes_hired, group=year),colour="black",size=.5)+ #draws the actual bikes hired
geom_ribbon(aes(ymin=expected_hired,ymax=expected_hired+up, group=year),fill="#7DCD85",alpha=0.4) + #creates green shaded
geom_ribbon(aes(ymin=expected_hired,ymax=expected_hired+down, group=year),fill="#CB454A",alpha=0.4)+ #creates red shaded
facet_wrap(~year)+ #creates plots for years
theme_bw() +
labs(title = "Monthly changes in TfL bike rentals",
subtitle = "change from monthly average shown in blue and calculated between 2016-2019", caption= "Source: TfL, London Data Store",
x="", y="Bike Rentals") +
NULLThe second graph we will recreate looks at percentage change from the expected level of weekly rentals. The two grey shaded rectangles correspond to Q2 (weeks 14-26) and Q4 (weeks 40-52).
Here the green shaded region shows the percentage of rentals above the average and the red shows the percentage below.
#Calculating the weekly means
expected_hires_week <- bike %>%
filter(day>="2016-01-01" & day<"2020-01-01")%>%
group_by(year, week) %>%
summarize(bikes_hired = mean(bikes_hired)) %>% #takes the daily data and creates a weekly mean for each year/week combo
group_by(week) %>%
summarise(expected_hired = mean(bikes_hired)) #outputs mean bike rentals by week with 53 rows
#modifying the dataset and adding the averages previously calculated in expected_hires
plot_bike_week <- bike %>%
filter(day>="2016-01-01")%>%
filter(!(year==2021 & week==53)) %>% # gets rid of week 53 in 2021 causing weird line in plot.
group_by(year, week) %>%
summarize(bikes_hired = mean(bikes_hired)) %>%
inner_join(expected_hires_week, by = "week") %>% #joins the two datasets (one with mean) by week
mutate(fillcolor = bikes_hired>expected_hired,
excess_rentals = bikes_hired - expected_hired, #calculates rentals above average
percentage_change_expected = (excess_rentals/expected_hired), #calcualtes percentage above avg
up = ifelse(percentage_change_expected>0, (excess_rentals/expected_hired), 0),
down = ifelse(percentage_change_expected>0, 0,(excess_rentals/expected_hired)))
plot_bike_week$lower = apply(plot_bike_week[,3:4],1,min) # creates a column taking the smallest value from actual vs average bikes hired
plot_bike_week$higher = apply(plot_bike_week[,3:4],1,max) # creates a column taking the largest value from actual vs average bikes hired
plot_bike_week %>% ggplot(aes(x=week, y=percentage_change_expected)) +
annotate(geom="rect", xmin = 14,xmax = 26, ymin=-Inf, ymax=Inf, alpha=0.1) + #Q2
annotate(geom="rect", xmin = 40,xmax = 52, ymin=-Inf, ymax=Inf, alpha=0.1) + #Q4
geom_line(aes(x=week, y=percentage_change_expected)) + #Creates average line
geom_ribbon(aes(ymin=0,ymax=up, group=year),fill="#7DCD85",alpha=0.4) + #adds green shaded
geom_ribbon(aes(ymin=0,ymax=down, group=year),fill="#CB454A",alpha=0.4)+ #adds red shaded
geom_rug(aes(x=week), color=ifelse(plot_bike_week$fillcolor,"#7DCD85","#CB454A"), sides="b") +
facet_wrap(~year) +
scale_y_continuous(labels = scales::percent) + #adds percent on axis
theme_bw() +
theme(legend.position = "none") +
labs(title = "Weekly change in TfL bike rentals",
subtitle = "% change from weekly averages calculated between 2016-2019")Should you use the mean or the median to calculate your expected rentals? Why?
In our graphs we calculate the expected number of rentals per week or month between 2016-2019 and then, see how each week/month of 2020-2021 compares to the expected rentals. Think of the calculation excess_rentals = actual_rentals - expected_rentals. We use the mean instead of the median as we want to look at the expected number of rentals which is the mean. Also there is unlikely to be many outliers in the bike rentals which lets us conclude that the mean is a good representation. The graphs are identical when the mean is used.
Team Members: Alex Kubbinga, Clara Moreno Sanchezu, Jean Huang, Raghav Mehta, Raina Doshi, Yuan Gao